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Abstract 

The  string  matching  problem  is  one  of  the  most  studied  problems 
in  computer  science.  While  it  is  very  easily  stated  and  many  of  the 
simple  algorithms  perform  very  weE  in  practice,  numerous  works  have 
been  pubUshed  on  the  subject  and  research  is  stiU  very  active.  In  this 
paper  we  survey  recent  results  on  parallel  algorithms  for  the  string 
matching  problem. 


1  Introduction 

You  are  given  a  copy  of  Encyclopedia  Britanica  and  a  word  and  requested 
to  find  all  the  occurrences  of  the  word.  This  is  an  instance  of  the  string 
matching  problem.  More  formally,  the  input  to  the  string  matching  problem 
consists  of  two  strings  TEXT[l..n]  and  PATT ERN[\..m]',  the  output  should 
list  all  occurrences  of  the  pattern  string  in  the  text  string.  The  symbols  in 
the  strings  are  chosen  from  some  set  which  is  called  an  alphabet.  The  choice 
of  the  alphabet  sometimes  allows  us  to  solve  the  problem  more  efficiently  as 
we  will  see  later. 

A  naive  algorithm  for  solving  the  string  matching  problem  can  proceed  as 
follows:  Consider  the  first  n  —  m  + 1  positions  of  the  text  string.  Occurrences 
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of  the  pattern  can  start  only  at  these  positions.  The  algorithm  checks  each 
of  these  positions  for  an  occurrence  of  the  pattern.  Since  it  can  take  up  to 
m  comparisons  to  verify  that  there  is  actually  an  occurrence,  the  time  com¬ 
plexity  of  this  algorithm  is  0(nm).  Note  that  the  only  operations  involving 
the  input  strings  in  this  algorithm  are  comparisons  of  two  symbols. 

The  only  assumption  we  made  about  the  alphabet  in  the  zdgorithm  de¬ 
scribed  above  is  that  alphabet  symbols  can  be  compared  and  the  comparison 
results  in  an  equal  or  unequal  answer.  This  assumption,  often  referred  to  as 
the  general  alphabet  assumption,  is  the  weakest  assumption  we  will  have  on 
the  alphabet,  and,  as  we  have  seen,  is  sufficient  to  solve  the  string  matching 
problem.  However,  although  the  definition  of  the  string  matching  problem 
does  not  require  the  alphabet  to  be  ordered,  an  arbitrary  order  is  exploited  in 
several  algorithms  [3,  4,  21,  22]  which  make  use  of  some  combinatorial  prop¬ 
erties  of  strings  over  an  ordered  alphabet  (40).  This  assumption  is  reasonable 
since  the  alphabet  symbols  are  encoded  numerically,  which  introduces  a  nat¬ 
ural  order.  Other  algorithms  use  a  more  restricted  model  where  the  alphabet  • 
symbols  are  small  integers.  Those  algorithms  usually  take  advaintage  of  the 
fact  that  symbols  can  be  used  as  indices  of  an  array  [2,  6,  42,  48]  or  that 
many  symbols  can  be  packed  together  in  one  register  [26].  This  case  is  usually 
called  fixed  alphabet. 

Many  sequential  algorithms  exist  for  the  string  matching  problem  and  are 
widely  used  in  practice.  The  better  known  are  those  of  Knuth,  Morris  and 
Pratt  [35]  and  Boyer  and  Moore  [12].  These  algorithms  xhieve  0{n  -f-  m) 
time  which  is  the  best  possible  in  the  worse  case  and  the  latter  algorithm 
performs  even  better  on  average.  Another  well  known  algorithm  which  was 
discovered  by  Aho  and  Corasik  [2]  searches  for  multiple  patterns  over  a  fixed 
alphabet.  Many  variations  on  these  algorithms  exist  and  an  excellent  survey 
paper  by  Aho  [1]  covers  most  of  the  techniques  used. 

All  these  algorithms  use  an  0(m)  auxiliary  space.  At  a  certain  time  it  vf&s 
known  that  a  logarithmic  space  solution  was  possible  [28],  and  the  problem 
was  conjectured  to  have  a  time-space  trade  off  [10].  This  conjecture  was  later 
disproved  when  a  linear-time  constant-space  algorithm  was  discovered  [29] 
(see  also  [21]).  It  wa.s  shown  that  even  a  6-head  two-way  finite  automaton  can 
perform  string  matching  in  linear  time.  It  is  still  an  open  problem  whether 
a  Ar-head  one-way  finite  automaton  can  perform  string  matching.  The  only 
known  cases  are  for  A:  =  1,2,3  [30,  38,  39]  where  the  answer  is  negative. 

Recently,  few  papers  have  been  published  on  the  exact  complexity  of 
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the  string  matching  problem.  Namely,  the  exact  number  of  comparisons 
necessary  in  the  case  of  a  general  alphabet.  Surprisingly,  the  upper  bound 
of  about  2n  comparisons,  the  best  known  before  [5,  35],  was  improved  to 
|n  comparisons  by  Colussi,  Galil  and  Giancarlo  [16].  In  a  recent  work  Cole 
[17]  proved  that  the  number  of  comparisons  performed  by  the  original  Boyer- 
Moore  algorithm  is  about  3n. 

In  this  paper  we  will  focus  on  parallel  algorithms  for  the  string  matching 
problem.  Many  other  related  problems  have  been  investigated  and  are  out 
of  the  scope  of  this  paper  [1,  27].  For  an  introduction  to  parallel  algorithms 
see  surveys  by  Karp  and  Ramachandran  [33]  and  Eppstein  and  Galil  [24]. 

In  parallel  computation,  one  has  to  be  more  careful  about  the  definition 
of  the  problem.  We  assume  the  the  input  strings  are  stored  in  memory  and 
the  reqmred  output  is  a  Boolean  array  MATCH[\..n\  which  will  have  a  ‘true’ 
value  at  each  position  where  the  pattern  occurs  and  ‘false’  where  there  is  no 


occurrence. 


All  algorithms  considered  in  this  paper  are  for  the  parallel  random  access  - 
machine  (PRAM)  computation  model.  This  model  consists  of  some  pro¬ 
cessors  with  access  to  a  shared  memory.  There  are  several  versions  of  this 
model  which  differ  in  their  simultaneous  access  to  a  memory  location.  The 
weakest  is  the  exclusive- read  exclusive-write  (EREW-PRAM)  model  where 
at  each  step  simultaneous  read  operation  and  write  operations  at  the  same 
memory  location  are  not  allowed.  A  more  powerful  model  is  the  concurrent- 
read  exclusive-write  (CREW-PRAM)  model  where  only  simultaneous  read 
operations  are  allowed.  The  most  powerful  model  is  the  concurrent-read 
concurrent-write  (CRCW-PRAM)  model  where  read  and  write  operations 
can  be  simultaneously  executed. 

In  the  case  of  the  CRCW-PRAM  model,  there  are  several  ways  of  how 
write  conflicts  are  resolved.  The  weakest  model,  called  the  common  CRCW- 
PRAM  assumes  that  when  several  processors  attempt  to  write  to  a  certain 
memory  location  simultaneously,  they  all  write  the  same  vzdue.  A  stronger 
model  called  the  arbitrary  CRCW-PRAM  2issumes  an  arbitrary  value  will 
be  written.  An  even  stronger  model,  the  priority  CRCW-PRAM  assumes 
each  processor  has  a  priority  and  the  highest  priority  processor  succeeds  to 
write.  Most  of  the  CRCW-PRAM  algorithms  described  in  this  paper  can 
be  implemented  in  the  common  model.  In  fact  these  algorithms  can  be 
implemented  even  if  we  assume  that  the  same  constant  value  is  always  used 
in  case  of  concurrent  writes.  However,  to  simplify  the  presentation  we  will 
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sometimes  use  the  more  powerful  priority  model. 

For  the  ailgorithms  discussed  in  this  paper  we  assume  that  the  length  of 
the  text  string  is  n  =  2m,  where  m  is  the  length  of  the  pattern  string.  This 
is  possible  since  the  text  string  can  be  broken  into  overlapping  segments  of 
length  2m  and  each  segment  can  be  searched  in  parallel. 

Lower  bounds  for  some  basic  parallel  computational  problems  can  be 
applied  to  string  matching.  A  lower  bound  of  for  computing  the 

parity  of  n  input  bits  on  a  CRCW-PRAM  with  any  polynomial  number  of 
processors  [7]  implies  that  one  cannot  count  the  number  of  occurrences  faster 
than  Another  lower  bound  of  fl(logn)  for  computing  a  Boolean 

AND  of  n  input  bits  on  any  CREW-PRAM  [20]  implies  an  n(logn)  lower 
bound  for  string  matching  in  this  parallel  computation  model. 

These  lower  bounds  make  the  possibility  of  sublogarithmic  parallel  algo¬ 
rithms  for  any  problem  very  unlikely.  However  severzJ  problems  are  known 
to  have  such  algorithms  [8,  9,  11,  36,  44,  45]  including  string  matching.  In 
fact,  a  very  simple  algorithm  can  solve  the  string  matching  problem  in  con¬ 
stant  time  using  nm  processors  on  a  CRCW-PRAM:  similarly  to  the  naive 
sequential  algorithm,  consider  each  possible  start  of  an  occurrence.  Assign 
m  processors  to  each  such  position  to  verify  the  occurrence.  Verifying  an 
occurrence  is  simple;  perform  all  m  comparisons  in  parallel  and  any  mis¬ 
match  changes  a  value  of  the  MATCH  array  to  indicate  that  an  occurrence 
is  impossible. 

The  following  theorem  will  be  used  throughout  the  paper. 

Theorem  1.1  (Brent  [13]):  Any  PRAM  algorithm  of  time  t  that  consists 
of  X  elementary  operations  can  be  implemented  on  p  processors  in  fx/p]  -I-  / 
time. 

Using  this  theorem  for  example,  we  can  slow  down  the  constant-time 
algorithm  describe  above  to  run  in  0(5)  time  on  22.  processors. 

In  the  design  of  a  parallel  algorithm,  one  is  also  concerned  about  the  total 
number  of  operations  performed,  which  is  the  time-processor  product.  The 
best  one  can  wish  for  is  the  number  of  operations  performed  by  the  fastest 
sequential  algorithm.  A  parallel  algorithm  is  called  optimal  if  that  bound  is 
achieved.  Therefore,  in  the  case  of  the  string  matching  problem,  an  algorithm 
'■  "otimal  if  the  time-processor  product  is  linear  in  the  length  of  the  input 
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An  optimal  parallel  algorithm  discovered  by  Galil  [26]  solves  the  problem 
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in  O(log  m)  time  using  processors.  This  algorithm  works  for  fixed  alpha¬ 
bet  and  was  later  improved  by  Vishkin  [46]  for  general  alphabet.  Optimal  al¬ 
gorithms  by  Karp  and  Rabin  [32]  and  other  algorithms  based  on  Karp,  Miller 
and  Rosenberg’s  [31]  method  [23,  34]  also  work  in  O(log  m)  time  for  fixed 
alphabet.  Breslauer  and  Galil  [14]  obtained  an  optimal  O(loglogm)  time  al¬ 
gorithm  for  general  alphabet.  Vishkin  [47]  developed  an  optimal  O(log*  m)^ 
time  algorithm.  Unlike  the  case  of  the  other  algorithms  this  time  bound 
does  not  account  for  the  preprocessing  of  the  pattern.  The  preprocessing  in 
Vishkin ’s  algorithm  takes  Vishkin ’s  super  fast  algorithm  raised 

the  question  whether  an  optimal  constant-time  algorithm  is  possible.  This 
question  was  partially  settled  in  a  recent  paper  by  Breslauer  and  Galil  [15] 
showing  an  f)(log  log  m)  lower  bound  for  parallel  string  matching  over  a  gen¬ 
eral  alphabet.  The  lower  boimd  proves  that  a  slower  preprocessing  is  crucial 
for  Vishkin’s  algorithm. 

This  paper  is  organized  as  follows.  In  Section  2  we  describe  the  log¬ 
arithmic  time  algorithms.  Section  3  is  devoted  to  Breslauer  and  Galil’s 
0(log  log  m)  time  algorithm.  Section  4  covers  the  matching  lower  bound. 
Section  5  outlines  the  ideas  in  Vishkin’s  0(log*  m)  algorithm.  In  some  cases 
we  will  describe  a  parallel  algorithm  that  achieves  the  claimed  time  bound 
using  n  processors.  The  optimal  version,  using  0(y)  processors,  can  be  de¬ 
rived  using  standard  techniques.  Many  questions  are  still  open  and  some  are 
listed  in  the  last  section  of  this  paper. 

2  Logarithmic  time  algorithms 

The  simplest  parallel  string  matching  algorithm  is  probably  the  randomized 
algorithm  of  Karp  and  Rabin  [32].  The  parallel  version  of  their  zdgorithm 
assumes  the  alphabet  is  binary  and  translates  the  input  symbols  into  a  2  x  2 
non-singular  matrices.  The  following  representation  is  used,  which  assures  a 
unique  representation  for  any  string  as  a  product  of  the  matrices  representing 
it. 


^The  function  log*m  is  defined  as  the  snudlest  k  such  that  log^^^m  <  2,  where 
m  =  hgm  and  m  =  loglog^‘^m. 
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Most  of  the  work  in  the  algorithm  is  performed  using  a  well  known  method 
for  parallel  prefix  computation  summarized  in  the  following  theorem: 
Theorem  2.1  (Folklore,  see  also  [37]):  Suppose  a  sequence  of  n  elements 
‘  sie  drawn  from  a  set  with  an  associative  operation  *,  com¬ 
putable  in  constant  time.  Let  p,-  =  xi  *  X2*  •••  Xi,  usually  called  a  prefix 
sum.  Then  an  EREW-PRAM  can  compute  all  pj  t  =  1  •  •  ■  n,  in  0(log  n) 
time  using  processors. 

Karp  and  Chin’s  algorithm  multiplies  the  matrices  representing  the  pat¬ 
tern  to  get  a  single  matrix  which  is  called  the  fingerprint  of  the  pattern. 
By  Theorem  2.1  this  can  be  done  by  a  j^^-processor  EREW-PRAM  in 
O(log  m)  time.  The  text  string  is  also  conv^ed  to  the  same  representation 
and  matches  can  be  reported  based  only  on  comparison  of  two  matrices;  the 
fingerprint  of  the  pattern  and  the  fingerprint  of  each  text  position.  To  com¬ 
pute  the  fingerprint  of  a  text  position  j,  which  is  the  product  of  the  matrix, 
representation  of  the  substring  starting  at  position  j  and  consisting  of  the 
next  m  symbols,  first  compute  all  prefix  products  for  the  matrix  represen¬ 
tation  of  the  text  and  call  them  P,-.  Then  compute  the  inverse  of  each  P,; 
the  inverse  exists  since  each  P,  is  a  product  of  invertible  matrices.  The  fin¬ 
gerprint  for  a  position  j,2  <  j  <  n  —  m  -|- 1  is  given  by  Pj_\Pi+m  _i;  the 
finger  print  of  the  first  position  is  P^.  By  Theorem  2.1  the  prefix  products 
also  take  optimal  0(log  m)  time  on  an  EREW-PRAM.  Since  the  remauning 
work  can  be  done  in  constant  optimal  time,  the  algorithm  works  in  optimal 
O(log  m)  total  time. 

However,  there  is  a  problem  with  the  algorithm  described  above.  The 
entries  of  those  matrices  may  grow  too  large  to  be  represented  in  a  single 
register;  so  the  numbers  are  truncated  modtilo  some  random  prime  p.  All 
computations  are  done  in  the  field  Zp  which  assures  that  the  matrices  are 
still  invertible. 

This  truncated  representation  does  not  assure  uniqueness,  but  Karp  and 
Rabin  show  that  the  probability  of  their  algorithm  erroneously  reporting  a 
nonexisting  occurrence  is  very  small  if  p  is  chosen  from  a  range  which  is 
large  enough.  This  algorithm  is  in  fact  the  only  parallel  algorithm  which 
works  in  optimal  logarithmic  time  on  an  EREW-PRAM;  all  the  algorithms 
we  describe  later  need  a  CRCW-PRAM. 

The  method  used  by  Karp,  Miller  and  Rosenberg  [31]  for  sequential  string 
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matching  can  be  adopted  also  for  parallel  string  matching.  Although  the 
original  algorithm  worked  in  O(nlogn)  time,  Kedem,  Landau  and  Palem 
[34]  were  able  to  obtain  an  optimal  O(log  m)  time  parallel  algorithm  using 
a  similar  method.  Chrochemore  and  Rytter  [23]  independently  suggested 
a  parallel  implementation  in  O(logm)  time  using  n  processors.  Another 
parallel  algorithm  which  uses  a  similar  method  is  the  suffix  tree  construction 
algorithm  of  Apostolico  et  al.  [6]  which  can  also  be  used  to  solve  the  string 
matching  problem.  All  these  parallel  algorithms  need  an  arbitrary  CRCW- 
PRAM  and  are  for  fixed  alphabets;  they  also  need  a  large  memory  space.  It 
seems  that  this  method  cannot  be  used  to  obtain  faster  than  O(log  n)  string 
matching  algorithms,  however  it  is  applicable  to  other  problems  [23]. 

We  describe  a  logarithmic  time  implementation  of  the  Karp,  Miller  and 
Rosenberg  [31]  method  for  an  n-processor  arbitrary  CRCW-PRAM.  Consider 
the  input  as  one  string  of  length  /  =  n  +  m  which  is  a  text  of  length  n 
concatenated  with  a  pattern  of  length  m.  Two  indices  of  the  input  string 
are  called  A;-equivalent  if  the  substring  of  length  k  starting  at  those  indices  * 
are  equal;  this  is  in  fact  an  equivalence  relation  on  the  set  of  indices  of  the 
input  string.  The  algorithm  assigns  unique  names  to  each  index  in  the  same 
equivalence  class.  The  goal  is  to  find  all  indices  which  are  in  the  same  m- 
equivalence  class  of  the  index  where  the  pattern  starts. 

We  denote  by  n(i,  j)  the  unique  name  assigned  to  the  substring  of  length 
j  starting  at  position  i  of  the  input  string;  assume  n{i,j)  is  defined  only  for 
i+  j  <  /  +  1  and  the  names  are  integers  in  the  range  !•••/.  Suppose  n{i,  r) 
and  n(i,  s)  are  known  for  all  positions  t  of  the  input  string.  One  can  easily 
combine  these  names  to  obtain  n(t,  r  +  s)  for  all  positions  i  in  constant  time 
using  2  processors  as  follows:  Assume  a  two  dimensional  array  of  size  2  x  2  is 
available;  assign  a  processor  to  each  position  of  the  input  string.  Note  that 
the  string  of  length  r  +  s  starting  at  position  i  is  actually  the  string  of  length 
r  starting  at  position  i  concatenated  with  the  string  of  length  s  starting  at 
position  i  +  r.  Each  processor  will  try  to  write  the  position  number  it  is 
assigned  to  in  the  entry  at  row  n(t,r)  and  colunm  n(i  +  r,s)  of  the  matrix. 
If  more  then  one  processors  attempts  to  write  the  same  entry,  assume  an 
arbitrary  one  succeeds.  Now  n(i,r  +  s)  is  assigned  the  value  written  in  row 
n(i,  r)  and  column  n(t  +  r,  s)  of  the  matrix.  That  is,  n(i,  r  +  s)  is  an  index 
of  the  input  string,  not  necessarily  t,  which  is  (r  +  s)-equivalent  to  t. 

The  algorithm  start  with  «(*,  1)  which  is  the  symbol  at  position  i  of 
the  string,  assuming  the  alphabet  is  the  set  of  integers  between  1  and  2. 
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It  proceeds  with  O(logm)  steps  computing  n(t,2),  n(t,4),  for 

j  <  [log2mj,  by  merging  names  of  two  2^ -equivalence  classes  into  a  names 
of  2^'''^-equivalence  classes.  In  another  0(log  m)  steps  it  computes  n(t,  m)  by 
merging  a  subset  of  the  names  of  power-of-two  equivalence  classes  computed 
before,  and  reports  aU  indices  which  axe  in  the  same  m-equivalence  class  of 
the  starting  index  of  the  pattern. 

This  algorithm  requires  O(m^)  space  which  can  be  reduced  to 
for  a  time  tradeoff  as  described  in  the  suffix  tree  construction  algorithm  of 
Apostolico  et  al.  [6]. 

In  the  rest  of  this  section  we  will  describe  the  algorithm  of  Vishkin  [46], 
on  which  the  faster  algorithms,  described  later,  are  based. 

As  we  have  seen  before,  if  we  have  nm  processor  CRCW-PRAM,  we 
can  solve  the  string  matching  problem  in  constant  time  using  the  following 
method: 

•  First,  mark  all  possible  occurrences  of  the  pattern  as  ‘true’. 

•  To  each  such  possible  beginning  of  the  pattern,  assign  m  processors. 
Each  processor  compares  one  symbol  of  the  pattern  with  the  corre¬ 
sponding  symbol  of  the  text.  K  a  mismatch  is  encountered,  it  marks 
the  appropriate  beginning  as  ‘false’. 

Assuming  that  we  can  eliminate  all  but  I  of  the  possible  occurrences  we 
can  use  the  method  described  above  to  get  a  constant  time  parallel  algorithm 
with  Im  processors.  Both  Galil  [26]  and  Vishkin  [46]  use  this  approach.  The 
only  problem  is  that  one  can  have  many  occurrences  of  the  pattern  in  the 
text,  even  much  more  than  the  ^  needed  for  optimality  in  the  discussion 
above. 

To  overcome  this  problem,  we  introduce  the  notion  of  the  period  used  in 
these  two  papers.  A  string  u  is  called  a  period  of  a  string  w  if  u;  is  a  prefix  of 
u*  for  some  positive  integer  k  or  equivalently  if  u;  is  a  prefix  of  uw.  We  call 
the  shortest  period  of  a  string  w  the  period  of  w.  For  example,  the  period  of 
the  string  abacahacaba  is  abac.  The  string  abacabac  is  also  a  period,  so  is  the 
string  abacabacab. 

Lemma  2.2  (Lyndon  and  Schutzenberger  [41]):  If  w  has  two  periods  of 
length  p  and  q  and  |ti;|  >  p  -I-  q,  then  w  has  a  period  of  length  gcd(p,  q). 

If  a  pattern  w  occurs  in  positions  t  and  j  of  some  string  and  0  <  j —i  <  |ti;| 
then  the  occurrences  must  overlap.  This  implies  that  w  has  a  period  of  length 
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j  —  i.  Therefore,  we  cannot  have  occurrences  of  w  at  positions  j  and  i  if 
0  <  j  —  i  <  lu|  and  u  is  the  period  of  the  pattern.  Clearly  there  are  no  more 
then  ^  occurrences  of  the  pattern  in  a  string  of  length  n. 

If  the  pattern  is  longer  than  twice  its  period  length  then  instead  of  match¬ 
ing  the  whole  pattern  w  we  look  only  for  occurrences  of  u*,  its  period  repeated 
twice.  (Note  that  has  the  same  period  length  as  iw  by  Lemma  2.2.)  This 
case  where  the  pattern  is  longer  than  twice  its  period  is  called  the  periodic 
case. 

Assuming  we  could  eliminate  many  of  the  occurrencesof  and  have 
only  n/(u|  possible  occurrences  left,  we  can  use  the  constant-time  algorithm 
described  above  to  verify  these  occurrences  using  only  2n  processors.  Then, 
by  counting  the  number  of  consecutive  matches  of  u^,  we  can  match  the 
whole  pattern. 

Vishkin  [46]  shows  how  to  count  the  consecutive  matches  in  optimal 
(9(log  m)  time  on  an  EREW-PRAM  using  ideas  which  are  similar  to  prefix 
computation.  Breslauer  and  Galil  [14]  show  how  it  can  be  done  in  constant- 
optimal  time  on  a  CRCW-PRAM  (and  thus  in  optimal  O(logm)  time  on 
an  EREW-PRAM).  Assume  without  loss  of  generality  that  the  text  is  of 
length  n  <  |m  and  the  pattern  is  where  u  is  its  period  length.  Call  an 
occurrence  of  at  position  i  an  initial  occurrence  if  there  is  no  occurrence 
of  at  position  *  —  |u|  and  a  final  occurrence  if  there  is  no  occurrence  at 
position  i  -1-  ju].  There  is  at  most  one  initial  occurrence  which  can  start  an 
actual  occurrence  of  the  pattern:  the  rightmost  initial  occurrence  in  the  first 
Y  positions.  Any  initial  occurrence  in  a  position  greater  them  y  cannot  start 
an  occurrence  of  the  pattern  since  the  text  is  not  long  enough.  Any  initiaJ 
occurrence  on  the  left  cannot  start  an  occurrence  of  the  pattern  since  u,  the 
period  length  of  the  pattern,  is  not  repeated  enough  times.  The  correspond¬ 
ing  final  occurrence  is  the  leftmost  final  occurrence  to  the  right  of  the  initial 
occurrence.  By  substructing  the  positions  of  the  initial!  occurrence  from  the 
final  occurrences  and  verifying  an  occurrence  of  v  starting  after  the  final  oc¬ 
currence,  one  can  tell  how  many  times  the  period  is  repeated  and  what  are 
the  actual  occurrences  of  the  pattern. 

For  the  rest  of  the  description  we  assume  without  loss  of  generality  that 
the  pattern  is  shorter  than  twice  its  period  length,  what  is  called  the  non 
periodic  case. 

Suppose  u  is  the  period  of  the  pattern  w.  If  we  compare  two  copies  of  w 
shifted  with  respect  to  each  other  by  i  positions  for  0  <  i  <  |u|,  there  must 
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be  at  least  one  mismatch.  Vishkin  [46]  takes  one  of  these  mismatches  and 
calls  it  a  witness  to  the  fact  that  t  is  not  a  period  length.  More  formally,  let 
k  be  the  index  of  one  such  mismatch,  then 

PATTERN[k]  ^  PATTEm^  -  ij. 

We  call  this  k  a  witness,  and  define 

WlT}^ESS\i^\\  =  k. 

Using  this  witness  information  Vishkin  suggests  a  method  which  he  calls  a 
duel  to  eliminate  at  least  one  of  two  dose  possible  occurrences.  Suppose  t  and 
j  are  possible  occurrences  and  0  <  j— i  <  |«|.  Then,  r  =  WITNESS[j—i+l] 
is  defined.  Since  PATTERN[r]  PArTERN\r  +  i  —  j],  at  most  one  of 
them  is  equal  to  TEXT[i  +  r  —  1]  (see  figure  2.1),  and  at  least  one  of  the 
possible  occurrences  can  be  ruled  out  (As  in  a  real  duel  sometimes  both  can 
be  ruled  out.). 


I_ 1 

1  X  I 

j-i+l  r 

Y  ..■■■  —  I 

r+H 

Figure  2,1.  X  ^  V  and  therefore  we  cannot  have  T  =  X  and  T  =Y. 

Vishkin’s  algorithm  [46]  consists  of  two  phases.  The  first  is  the  pattern 
analysis  phase  in  which  the  witness  information  is  computed  to  help  later 
with  a  text  analysis  phase  which  finds  the  actual  occurrences. 

We  start  with  a  description  of  the  text  analysis  phase.  Let  V  =  |u(  be 
the  period  length  of  the  pattern.  The  pattern  analysis  phase  described  later 
computes  witnesses  only  for  the  first  half  of  the  pattern.  If  the  pattern  has 
a  period  which  is  longer  than  half  its  length,  we  define  V  =  [y] . 

The  text  analysis  phase  works  in  stages.  There  are  [logP]  stages.  At 
stage  i  the  text  string  is  partitioned  into  consecutive  blocks  of  length  2*.  Each 
such  block  has  only  one  possible  start  of  an  occurrence.  We  start  at  stage  0 
where  the  blocks  are  of  size  one,  and  each  position  of  the  string  is  a  possible 
occurrence. 
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At  stage  t,  consider  a  block  of  size  2*+^  which  consists  of  two  blocks  of 
size  2*.  It  has  at  most  two  possible  occurrences  of  the  pattern,  one  in  each 
block  of  size  2*.  A  duel  is  performed  between  these  two  possible  occurrences, 
leaving  at  most  one  in  the  2*+^  block. 

At  the  end  of  [log'PJ  stages,  we  are  left  with  at  most  ^  possible  oc¬ 
currences  of  u  which  can  be  verified  in  constant-time  using  n  processors. 
Note  that  the  total  niunber  of  operations  performed  is  0(n)  and  the  time  is 
O(logm).  By  Theorem  1.1  an  optimal  implementation  is  possible.  Moreover, 
it  is  even  possible  to  implement  this  phase  on  a  CREW-PRAM  within  the 
same  time  bound.  It  is  the  pattern  analysis  phase  which  requires  a  CRCW- 
PRAM. 

The  pattern  analysis  phase  is  similar  to  the  text  analysis  ph<tse.  It  takes 
[log  mj  stages.  The  description  below  outlines  a  logarithmic  time  implemen¬ 
tation  using  m  processors. 

The  WITNESS  array  which  we  used  in  the  text  processing  stage  is 
computed  incrementally.  Knowing  that  some  witnesses  are  already  computed  • 
in  previous  stages,  one  can  easily  compute  more  witnesses.  Let  i  and  j  be 
two  indices  in  the  pattern  such  that  i  <  j  <  [>71/2].  If  s  =  WITNESS[j  — 
i  -f  1]  is  already  computed  then  we  can  find  at  least  one  of  WITNESS[i]  or 
WITNESS[j]  using  a  duel  on  the  pattern  as  follows: 

•  Ifs-|-t  —  l<m  then  s  -f  i  —  1  is  also  a  witness  either  for  i  or  for  j. 

•  IfsH-i  —  l>m  then  either  s  is  a  witness  for  j  ot  s—j  +  i  is  a  witness 
for  i  (see  figure  2.2). 


i 

_ j _ 

J 

1  X  1 

s 

Y 

i - 

zitl 

7. 

Figure  2.2.  X  and  therefore  we  cannot  have  Z  =  X  and  Z  —  Y. 

The  pattern  cinalysis  proceeds  as  follows:  At  stage  i  the  pattern  is  parti¬ 
tioned  into  consecutive  blocks  of  size  2’.  Each  block  has  at  most  one  yet-to- 
be-computed  witness.  The  first  block  never  has  WITNESS[1]  computed. 
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Consider  the  first  block  of  size  2*‘*‘*.  It  has  at  most  one  other  yet-to-be- 
computed  witness,  say  WITNESS[k].  We  first  try  to  compute  this  witness 
by  comparing  two  copies  of  the  pattern  shifted  with  respect  to  each  other  by 
A;  —  1  positions.  This  can  be  easily  done  in  constant  time  on  an  arbitrary 
CRCW-PRAM  with  m  processors.  If  a  witness  is  not  found,  then  k  is  the 
period  length  of  the  pattern  and  the  pattern  analysis  terminates.  If  a  witness 
was  found,  a  duel  is  performed  in  each  block  of  size  2*'^^  between  the  two 
yet-to-be-computed  witnesses  in  each  such  block.  It  results  in  each  block  of 
size  having  at  most  one  yet-to-be-computed  witness.  After  0{  [log  mj ) 
stages  the  witness  information  is  computed  for  the  first  half  of  the  pattern 
and  the  algorithm  can  proceed  with  the  text  analysis. 

The  optimal  implementation  of  the  pattern  analysis  is  very  similar  to 
Galil’s  [26]  original  algorithm.  Each  iteration  of  the  pattern  analysis  de¬ 
scribed  above  has  actually  two  steps:  the  first  step  trys  to  verify  a  period 
length  using  a  naive  algorithm  which  compares  long  strings;  if  fails,  a  witness 
is  found  and  it  is  used  in  a  step  which  is  identical  to  the  actual  string  analysis' 
phase. 

Suppose  the  naive  algorithm  would  be  applied  at  stage  i  just  to  verify  a 
period  length  of  a  prefix  of  the  pattern  of  length  2‘'*'^  instead  of  the  whole 
pattern.  If  a  mismatch  is  found  it  can  be  used  as  a  witness  as  described 
before.  If  no  mismatch  has  been  found,  continue  to  a  periodic  stage  i  -f  1 
which  will  try  to  verify  the  same  period  length  of  a  prefix  of  double  length.  At 
some  point  either  a  mismatch  is  found  or  the  period  length  is  verified  for  the 
whole  string  and  the  pattern  analysis  is  terminated.  If  a  mismatch  was  found, 
it  follows  from  Lemma  2.1  that  the  first  mismatch  can  be  used  as  a  witness 
value  for  all  uncomputed  witnesses  in  the  first  block;  and  the  algorithm  can 
catch  up  to  stage  i  -|- 1  (with  the  current  value  of  i)  by  performing  duels. 

Galil’s  [26]  original  algorithm  had  only  one  stage  which  consisted  of  sim¬ 
ilar  two  steps;  application  of  a  naive  algorithm  to  verify  a  period  length  of 
a  prefix  of  the  pattern  of  increasing  length  and  elimination  of  close  possible 
occurrences  which  would  imply  a  short  period  length.  The  main  difference  is 
that  Galil’s  algorithm  had  to  compare  long  strings  also  in  the  steps  Vishkin’s 
algorithm  uses  the  witness  information.  So  n  operations  are  performed  at 
each  round  making  the  algorithm  non-optimal.  Galil  [26]  suggests  an  im¬ 
provement  for  a  finite  alphabet  which  packs  O(logm)  symbols  in  a  single 
integer  and  thus  uses  less  processors  to  perform  the  comparisons,  making  an 
optimal  implementation  possible  in  O(log  m)  time. 
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3  An  O(loglogm)  time  algorithm 

The  O(loglogm)  time  algorithm  of  Breslauer  and  Galil  [14]  is  similar  to 
Vishkin’s  algorithm  from  the  previous  section.  The  method  is  based  adso  on 
an  algorithm  for  finding  the  maximum  suggested  by  Valiant  [45]  for  a  com¬ 
parison  model  and  implemented  by  Shiloach  and  Vishkin  [44]  on  a  CRCW- 
PRAM. 

As  before,  we  have  two  stages.  The  first  stage,  the  pattern  analysis, 
computes  the  witness  information  which  is  used  in  the  text  analysis  to  find 
the  actual  occurrences.  Let  V  =  |u|  be  the  period  length  of  the  pattern.  As 
before  if  the  pattern  has  a  period  length  longer  than  half  its  length,  we  define 

p=r?i- 

Partition  the  text  into  blocks  of  size  V  and  consider  each  one  separately. 
In  each  block  consider  each  position  as  a  possible  occurrence.  Assuming  we 
had  processors  for  each  such  block  a  duel  can  be  made  between  all  pairs 
of  possible  occurrences  resulting  with  at  most  one  occurrence  in  each  block.  ’ 
Since  we  have  only  n  processors,  partition  the  blocks  into  groups  of  size 
\/V  and  repeat  recursively.  The  recursion  bottoms  out  with  one  processor 
per  block  of  size  1.  When  done  we  are  left  with  one  possible  occurrence  at 
most  in  each  group  of  size  y/V.,  thus  s/V  possible  occurrences  all  together. 
Then  in  constant  time  make  all  duels  as  described  above.  We  are  left  with  a 
single  possible  occurrence  (or  none)  in  each  block  of  size  V  and  proceed  with 
counting  the  consecutive  occurrences  of  the  period  described  in  section  2. 

To  make  the  text  analysis  run  in  optimal  0(log  log  m)  time  we  start  with 
an  O(loglogP)  time  sequential  algorithm  which  runs  in  parallel  in  all  sub¬ 
blocks  of  length  log  log  P  leaving  only  Possible  occurrences  in  each 

block  by  performing  duels.  Then  proceed  with  the  procedure  above  starting 
with  the  reduced  number  of  possible  occurrences. 

The  pattern  analysis  can  be  done  also  in  optimal  0(log  log  m)  time.  We 
describe  here  only  an  m  processor  algorithm.  It  works  in  stages  and  it  takes 
at  most  log  log  m  stages.  Let  fc,-  =  fco  =  1*  At  the  end  of  stage  i,  we 

have  at  most  one  yet-to-be-computed  witness  in  each  block  of  size  fc,-.  The 
only  yet-to-be-computed  index  in  the  first  block  is  1. 

1.  At  the  beginning  of  stage  i  we  have  at  most  kifki^i  yet-to-be-computed 
witnesses  in  the  first  A:,— block.  Try  to  compute  them  using  the  naive  al¬ 
gorithm  on  PATTERN{\  •  •  •  2ki)  only.  This  takes  constant  time  using 
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2ki-j^  =  2m  processors. 

2.  If  we  succeed  in  producing  witnesses  for  all  the  indices  in  the  first  block 
(all  but  the  first  for  which  there  is  no  witness),  compute  witnesses  in 
each  following  block  of  the  same  size  using  the  optimal  duel  algorithm 
described  above  for  the  text  processing.  This  takes  O(log  log  m)  time 
only  for  the  first  stage.  In  the  following  stages,  we  will  have  at  most 
y/m  indices  for  which  we  have  no  witness,  and  duels  can  be  done  in 
0(1)  time. 

3.  If  we  fail  to  produce  a  witness  for  some  2  <  j  <  ki,  it  follows  that 
PATTERN{\  '  •  •  2A;,)  is  periodic  with  period  length  p,  where  p  =  j  — 
1  and  j  is  the  smallest  index  of  an  yet- to-be-computed  witness.  By 
Lemma  2.1  aJl  yet-to-be-computed  indices  within  the  first  block  axe  of 
the  form  kp  +  1.  Check  periodicity  with  period  length  p  to  the  end  of 
the  pattern.  If  p  turns  out  to  be  the  length  of  the  period  of  the  pattern, 
the  pattern  analysis  is  done  and  we  can  proceed  with  the  text  analysis. 
Otherwise,  the  smallest  witness  found  is  good  also  for  all  the  indices 
of  the  form  kp-\-l  which  are  in  the  first  A;,— block,  and  we  can  proceed 
with  the  duels  as  in  2. 

If  p  processors  are  available  and  m  <  p  <  m^,  this  algorithm  can  be 
modified  to  work  in  O(log  logsE  ”»)  time.  If  the  number  of  processors  is 
smaller  than  the  algorithm  can  be  slowed  down  to  work  in  ^  time. 

When  the  number  of  processors  is  larger  than  rp  the  naive  algorithm  solves 
the  problem  in  constant  time.  All  these  bounds  can  be  summerized  in  one 
expression:  Od"^]  -|- loglogfi+p/„»^  2p). 


4  A  lower  bound 

In  this  section  we  describe  the  lower  bound  of  Breslauer  and  Galil  [15]  for  a 
model  which  is  similar  to  Valiant’s  parallel  comparison  tree  model  [45].  We 
assume  the  only  access  the  algorithm  has  to  the  input  strings  is  by  compar¬ 
isons  which  check  whether  two  symbols  are  equal  or  not.  The  algorithm  is 
allowed  m  comparisons  in  each  round,  after  which  it  can  proceed  to  the  next 
round  or  terminate  with  the  answer.  We  give  a  lower  bound  on  the  minimum 
number  of  rounds  necessary  in  the  worst  case. 
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Consider  a  CRCW-PRAM  that  solves  the  string  matching  problem  over  a 
general  alphabet.  In  this  case  the  PRAM  can  perform  comparisons,  but  not 
computation,  with  its  input  symbols.  Thus,  its  execution  can  be  partitioned 
into  comparison  rounds  followed  by  computation  rounds.  Therefore,  a  lower 
bound  for  the  number  of  rounds  in  the  parallel  comparison  model  immedi¬ 
ately  translates  into  a  lower  bound  for  the  time  of  the  CRCW-PRAM.  If  the 
pattern  is  given  in  advance  and  any  preprocessing  is  &ee,  then  this  lower 
bound  does  not  hold,  as  Vishkin’s  O(log*  m)  2jgorithm  shows.  The  lower 
bound  also  does  not  hold  for  CRCW-PRAM  over  a  fixed  alphabet  strings. 
Similarly,  finding  the  maximum  in  the  parallel  decision  tree  model  has  ex¬ 
actly  the  same  lower  bound  [45],  but  for  small  integers  the  maximum  can  be 
found  in  constant  time  on  a  CRCW-PRAM  [25]. 

We  start  by  proving  a  lower  bound  for  a  related  problem  of  finding  the 
p#»rtod  length  of  a  string.  Given  a  string  5[l..m]  we  prove  that  n(loglogm) 
rounds  are  necessary  for  determining  whether  it  has  a  period  length  smaller 
than  y.  Later  we  show  how  this  lower  bound  translates  into  a  lower  bound' 
for  string  matching. 

We  show  a  strategy  for  an  adversary  to  answer  ^  log  log  m  rounds  of  com¬ 
parisons  after  which  it  still  has  the  choice  of  fixing  the  input  string  S  in  two 
ways:  in  one  the  resulting  string  has  a  period  of  length  smaller  th2m  y  and  in 
the  other  it  does  not  have  any  such  period.  This  implies  that  any  algorithm 
which  terminates  in  less  rounds  can  be  fooled. 

We  say  that  an  integer  fc  is  a  possible  period  length  if  we  can  fix  S 
consistently  with  answers  to  previous  comparisons  in  such  a  way  that  A:  is  a 
period  length  of  S.  For  such  fc  to  be  a  period  length  we  need  each  residue 
class  modulo  k  to  be  fixed  to  the  same  symbol,  thus  if  /  =  y  modk  then 
S[l\  =  S\j]. 

At  the  beginning  of  round  i  the  adversary  will  maintain  an  integer  A:,- 
which  is  a  possible  period  length.  The  adversary  answers  the  comparisons 
of  round  i  in  such  a  way  that  some  A:i4.i  is  a  possible  period  length  and  few 
symbols  of  S  are  fixed.  Let  Ki  =  The  adverscury  will  maintain 

the  following  invariants  which  hold  at  the  beginning  of  round  number  i: 

1.  ki  satisfies  ^Ki  <  ki  <  Ki. 

2.  If  S[l]  was  fixed  then  for  every  y  S  1  mod  ki  5[y]  was  fixed  to  the  same 
symbol. 


15 


3.  If  a  comp2aison  was  answered  as  equal  then  both  symbols  compared 
were  fixed  to  the  same  value. 

4.  If  a  comparison  was  answered  as  unequal,  then 

a.  it  was  between  different  residues  modulo  ki; 

b.  if  the  symbols  were  fixed  then  they  were  fixed  to  different  values. 

5.  The  number  of  fixed  symbols  /<  satisfies  /,  <  K,. 

Note  that  invariants  3  and  4  imply  consistency  of  the  answers  given  so  far. 
Invariants  2,  3  and  4  imply  that  ki  is  a  possible  period  length:  if  we  fix  all 
symbols  in  each  unfixed  residue  class  modulo  ki  to  a  new  symbol,  a  different 
symbol  for  different  residue  classes,  we  obtzdn  a  string  consistent  with  the 
answers  given  so  far  that  has  a  period  length  ki. 

We  start  at  roimd  number  1  with  ki  =  Ki  =  1.  It  is  easy  to  see  that  the 
invariants  hold  initially.  We  show  how  to  answer  the  comparisons  of  round  i 
and  how  to  choose  fc,+i  so  that  the  invciriants  still  hold.  All  multiples  of  ki 
in  the  range  . . .  Ki+i  are  candidates  for  the  new  ki+i .  A  comparison 

S[I\  =  5[;]  must  be  answered  as  equal  if  I  =  j  mod  ki+i .  We  say  that 
forces  this  comparison. 

Theorem  4.1  (see  [43]):  For  large  enough  n,  the  number  of  primes  between 
1  and  n  denoted  by  7r(n)  satisfies,  <  7r(n)  < 

Corollary:  The  number  of  primes  between  in  and  n  is  greater  than 

Lemma  4.2:  If  p,  g  and  are  relatively  prime,  then  a  comparison 

5[/]  =  S[k]  is  forced  by  at  most  one  of  pki  and  qki. 

Proof:  Assume  I  =  k  mod  pki,  I  =  k  mod  qki  for  1  <  I,  k  <  m.  Then  also 
I  =  k  modpqki.  But  pqki  >  m  and  1  <  I,  fc  <  m  which  implies  I  =  k,  a. 
contradiction.  □ 

Lemma  4.3:  The  number  of  candidates  for  which  are  prime  multiples 
of  ki  and  satisfy  <  fcj+i  <  Ki+i  is  greater  than  Each  such 

candidate  satisfies  the  condition  of  Lemma  4.2. 

Proof:  These  candidates  are  of  the  form  pki  for  prime  p.  The  number  of 
such  prime  values  of  p  can  be  estimated  using  the  corollary  to  Lemma  4.1. 
It  is  at  least 
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1  ^  ^<+1 
4  ki log ”  4Ki\ogm' 


Each  one  of  these  candidates  also  satisfies  the  condition  of  Lemma  4.2 
since  ki  <  Ki,  pki  >  and 


ki  AKi 


1  m  1  2  ^  m 


Lemma  4.4:  There  exists  a  candidate  for  fc,+i  in  the  range  \Ki+i . . .  Ki+i 
that  forces  at  most  comparisons. 

Proof:  By  Lemma  4.3  there  are  at  least  such  candidates  which  are 

prime  multiples  of  ki  and  satisfy  the  condition  of  Lemma  4.2.  By  Lenuna  4.2. 
each  of  the  m  comparisons  is  forced  by  at  most  one  of  them.  So  the  total 
number  of  comparisons  forced  by  all  these  candidates  is  at  most  m.  Thus, 
there  is  a  candidate  that  forces  at  most  comparisons.  □ 

Lemma  4.5:  For  m  large  enough  and  t  <  i  log  log  m,  1  +  m^'^~’161ogm  < 
m  . 

Proof:  For  m  large  enough, 

1  2 

log  log  (1  +  16  log  m)  <  -  log  log  m  =  (1  -  -)  log  log  m 

log(l  +  16 log  m)  <  4“l’^*®*’"logm 

1  +  161ogm  <  m*  <  rn*~\ 

from  which  the  lemma  follows.  □ 

T  emma  4.6:  Assume  the  invariants  hold  at  the  beginning  of  round  i  and 
the  adversary  chooses  fcj+i  to  be  a  candidate  which  forces  at  most 
comparisons.  Then  the  adversary  can  answer  the  comparisons  in  round  i  so 
that  the  invariants  also  hold  at  the  beginning  of  round  t  +  1. 

Proof:  By  Lemma  4.4  such  exists.  For  each  comparison  that  is  forced  by 
ki^.1  and  is  of  the  form  S[t\  =  5[;]  where  I  =  j  mod  ki+i  the  adversary  fixes 
the  residue  class  modulo  fc,+i  to  the  same  new  symbol  (a  different  symbol 
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for  different  residue  classes).  The  adversary  answers  comparisons  between 
ffxed  symbols  based  on  the  value  they  aure  fixed  to.  All  other  comparisons 
involve  two  positions  in  different  residue  classes  modulo  ki+i  (and  at  least 
one  unfixed  symbol)  and  are  always  answered  as  unequal. 

Since  ki+i  is  a  multiple  of  ki,  the  residue  classes  modulo  ki  split;  each 
class  splits  into  residue  classes  modulo  ki+i.  Note  that  if  two  indices  are 
in  different  residue  classes  modulo  ki,  then  they  are  also  in  different  residue 
classes  modulo  ki+i;  if  two  indices  are  in  the  same  residue  class  modulo  ki^i, 
then  they  are  also  in  the  same  residue  class  modulo  ki. 

We  show  that  the  invariants  still  hold. 

1.  The  candidate  we  chose  for  ki^^-i  wzts  in  the  required  range. 

2.  Residue  classes  which  were  fixed  before  split  into  several  residue  classes, 
all  are  fixed.  Any  symbol  fixed  at  this  round  causes  its  entire  residue 
class  modulo  ki+i  to  be  fixed  to  the  same  symbol. 

3.  Equal  answers  of  previous  rounds  are  not  affected  since  the  symbols 
involved  were  fixed  to  the  same  value  by  the  invariants  held  before. 
Equal  answers  of  this  round  are  ather  between  symbols  which  were 
fixed  before  to  the  same  value  or  are  within  the  same  residue  class 
modulo  ki.i.1  and  the  entire  residue  class  is  fixed  to  the  same  value. 

4.  a.  Unequal  answers  of  previous  rounds  are  between  different  residue 

classes  modulo  ki.i.i  since  residue  classes  modulo  ki  split.  Unequal 
answers  of  this  round  are  between  different  residue  classes  because 
comparisons  within  the  same  residue  class  modulo  ki+i  are  always 
answered  as  equal. 

b.  Unequal  answers  which  involve  symbols  which  were  fixed  before  this 
round  are  consistent  because  fixed  values  dictate  the  answers  to 
the  comparisons.  Unequal  answers  which  involve  symbols  that  are 
fixed  at  the  .’ud  of  this  round  and  ^  t  least  one  was  fixed  at  this 
round  are  consistent  since  a  new  symbol  is  used  for  each  residue 
class  fixed. 

5.  We  prove  inductively  that  /j+i  <  Ki+i.  We  fix  at  most  residue 

classes  modulo  ki+i .  There  are  ki+i  such  classes  and  each  class  has  at 
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most  elements.  By  Lemma  4.5  and  simple  algebra  the 

number  of  fixed  dements  satisfies: 


2m  - 

irrxKi  log  m 

/i+l  < 

ft  + 

^+1 

Ki^x 
'  \  2 

< 

Ki 

16  log  m 

< 

m}- 

(1  -1-  m^'^^lSlog 

< 

mf- 

4-‘ 

Ki+x-  □ 

Theorem  4.7:  Any  comparison-based  parallel  algorithm  for  finding  the  pe¬ 
riod  length  of  a  string  5[l..m]  using  m  comparisons  in  each  round  requires 
j  log  log  m  roimds. 

Proof:  Fix  an  algorithm  which  finds  the  period  of  S  and  let  the  adver¬ 
sary  described  above  answer  the  comparisons.  After  »  =  j  log  log  m  rounds 

/i+ijfcf+i  <  '•**••  "*  _  <  a.  Xhe  adversary  can  still  fix  S  to 

have  a  period  length  ki+i  by  fixing  each  remaining  residue  class  modulo 
to  the  same  symbol,  different  symbol  for  each  class.  Alternatively,  the  adver¬ 
sary  can  fix  all  unfixed  symbols  to  different  symbols.  Note  that  this  choice  is 
consistent  with  all  the  the  comparisons  answered  so  far  by  invariants  3  and 
4,  and  the  string  does  not  have  any  period  length  smaller  than  y.  Conse¬ 
quently,  any  algorithm  which  terminates  in  less  than  ^  log  log  m  rounds  can 
be  fooled.  □ 

Theorem  4.8:  The  lower  boimd  holds  also  for  any  comparison-based  string 
matching  algorithm  when  n  =  0(m). 

Proof:  Fix  a  string  matching  algorithm.  We  present  t  the  algorithm  a 
pattern  P[l..m]  which  is  and  a  text  7(1. .2m  —  1]  which  is  5'[2..2m], 

where  5  is  a  string  of  length  2m  generated  by  the  adversary  in  the  way 
described  above  (  We  use  the  same  adversary  that  we  used  in  the  previous 
proof;  the  adversary  sees  all  comparisons  as  comparisons  between  symbols 
in  5.).  After  ^  log  log  2m  rounds  the  adversary  still  has  the  choice  of  fixing 
S  to  have  a  period  length  smaller  than  m,  in  which  case  we  will  have  an 
occurrence  of  P  in  T,  or  to  fix  all  unfixed  symbols  to  completely  different 


19 


characters,  what  implies  that  there  would  be  no  such  occurrence.  Thus,  the 
lower  bound  holds  also  for  any  such  string  matching  algorithm.  □ 

This  lower  bound  actually  holds  even  if  the  algorithm  is  allowed  to  per¬ 
form  order  comparison  which  can  result  in  a  less  than,  equal  or  greater  than 
answers  as  shown  in  Breslauer  and  GaliPs  paper  [15].  When  the  number 
of  comparisons  in  each  round  is  p  and  n  =  0(m),  the  boimd  becomes 
n([^l  -I-  loglogfi+p/ml  '^^tching  the  upper  bound. 


5  A  faster  algorithm 

The  fast  string  matching  algorithm  of  Vishlcin  [47]  has  two  stages.  The 
pattern  analysis  stage  is  slow  and  takes  optimal  time  while  the 

text  analysis  is  very  fast  and  works  in  optimal  O(log*  m)  time.  An  alternative 
randomized  implementation  of  the  pattern  2aialysis  that  works  in  optimal 
O(log  m)  time  with  very  high  probability  will  not  be  covered  in  this  paper.  • 
As  we  have  seen  before,  we  can  assume  without  loss  of  generality  that 
the  pattern  is  shorter  than  twice  its  period  length.  Thus  witnesses  can  be 
computed  for  all  indices  which  are  smaller  than  y. 

Definition:  A  deterministic  sample  DS  =  [(is(l),ds(2),  •  •  •  ,<is(/)]  is  a  set  of 
positions  of  the  pattern  string  such  that  if  the  pattern  is  aligned  at  position 
i  of  the  text  and  the  symbols  at  positions  ds{l)  ■  •  ■  ds{l)  of  the  pattern  are 
verified,  that  is  PATTERN[ds{j)]  =  TEXT[i  +  ds{j)  —  1]  for  1  <  j  <  /, 
then  i  is  the  only  possible  occurrence  of  the  pattern  in  an  interval  of  length 
y  around  i. 

Deterministic  samples  are  useful  since  one  can  always  find  a  small  one. 
Lemma  5.1:  For  any  pattern  of  length  m,  a  deterministic  sample  of  size 
logm  —  1  exists. 

Proof:  We  show  how  to  find  a  deterministic  sample  of  length  logm  —  1.  K 
this  sample  is  verified  for  position  i  of  the  text  then  i  is  the  only  possible 
occurrence  in  an  interval  of  length  y  around  i. 

Consider  y  copies  of  the  pattern  placed  under  each  other,  each  shifted 
ahead  by  one  position  with  respect  to  the  previous  one.  Thus  copy  number 
k  is  aligned  at  position  k  of  copy  number  one.  Call  the  symbols  of  all  copies 
aligned  over  position  number  i  of  the  first  copy  column  i  (see  figure  5.1). 
Since  we  assume  that  the  pattern  is  shorter  than  twice  its  period  length  and 
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there  are  y  copies,  for  any  two  copies  there  is  a  witness  to  their  mismatch. 


1  2  3  4  5  s  i 


Figure  5.1.  Aligned  copies  of  the  pattern  and  a  column  t. 

Take  the  first  and  last  copies  and  a  witness  to  their  mismatch.  The 
column  of  the  mismatch  has  at  least  two  different  symbols  and  thus  one  of 
the  symbols  in  that  column,  in  either  the  first  or  the  last  copy,  appears  in 
the  column  in  at  most  half  of  the  copies.  Keep  only  the  copies  which  have 
the  same  symbol  in  that  column  to  get  a  set  of  at  most  half  the  number  of 
original  copies,  which  all  have  the  same  symbol  at  the  witness  column.  This 
procedure  can  be  repeated  at  most  log  m  —  1  times  until  there  is  a  single 
copy  left,  say  copy  number  k.  Note  that  all  columns  chosen  hit  copy  number 
k.  The  deterministic  sample  is  the  indices  in  copy  number  k  of  the  columns 
considered.  There  are  at  most  log  m  —  1  such  colunms.  If  this  sample  is 
verified  for  position  i  of  a  text  string  no  other  occurrence  at  positions  i  —  k+1 
to  i  —  fc  +  y  is  possible.  O 

One  can  find  such  a  deterministic  sample  in  parallel  by  the  constructive 
proof  of  Lemma  5.1.  Assume  the  witness  information  is  produced  by  either 
Vishkin’s  [46]  or  Breslaur  and  GaliPs  algorithm  [14]  (It  does  not  matter 
which  algorithm  since  the  time  bound  is  dominated  by  the  following  steps.). 
There  axe  O(logm)  steps  in  the  construction,  each  step  counts  how  many 
symbols  are  equal  to  the  witness  symbol  in  the  first  and  last  copies,  and  can 
be  implemented  using  Theorem  2.1  in  optimal  0(log  m)  time,  or  even  faster 
by  an  algorithm  of  Cole  and  Vishkin  [19]  for  prefix  sums  of  small  integers 
which  works  in  optimal  )  time. 

Since  the  sums  are  taken  at  each  round  only  for  copies  which  are  left,  the 
total  amount  of  operations  performed  at  each  round  is  half  the  number  of 
operations  of  the  previous  round  and  sums  up  over  all  rounds  to  be  linear.  By 
Brent’s  Theorem  the  total  time  required  for  the  pattern  analysis  is 
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with  optimal  number  of  processors. 

One  can  use  the  deterministic  sample  to  find  all  occurrences  of  the  pattern 
in  a  text  string  in  constant  time  and  O(nlog  m)  processors:  for  each  position 
verify  the  deterministic  sample  for  that  position  resulting  in  a  few  possible 
occurrences  which  can  be  verified  in  constant  time  using  a  linear  numbo:  of 
processors. 

We  now  show  how  to  use  the  data  structure  constructed  in  the  pattern 
analysis  phase  to  search  for  all  occurrences  of  the  pattern  st2krting  in  any 
position  of  a  block  of  size  m/2  of  the  text.  We  describe  only  an  0(log*  m) 
version  using  m  processors.  An  optimal  implementation  can  be  obtained 
using  standard  techniques. 

Assume  that  the  output  of  the  pattern  emalysis  phase  is  a  sequence  of 
compact  arrays  Aq,  Ai, '••,Ai  where  Aq  =  {—k  +  1, •  •  • , y  —  k}  is  the  set  of 
all  copies  of  the  pattern  considered  at  the  start  of  the  construction  of  the 
deterministic  sample  (relative  to  k,  the  copy  that  survived)  and  A,  C  Ao  is 
the  set  of  all  copies  remaining  at  the  end  of  step  i.  These  compact  arrays  ‘ 
can  be  generated  in  the  same  bounds  of  the  pattern  analysis  described  above 
and  are  used  to  efficiently  assign  processors  to  their  tasks. 

Initially  all  positions  in  the  block  are  candidates  for  a  potential  occurrence 
and  after  each  stage  only  part  of  the  candidates  will  survive. 

The  algorithm  starts  with  verifying  ds(l)  for  each  candidate.  This  takes 
constant  time  using  m  processors.  Call  the  candidates  for  which  there  is  a 
match  a  matching  candidate.  Let  I  and  r  be  the  index  of  the  leftmost  and 
rightmost  matching  candidate  respecively.  Consider  Ai  as  a  template  around 
I  and  aroimd  r  of  all  possible  occurrences  which  have  the  same  symbol  in  the 
column  under  ds(l)  relative  to  I  or  r.  Note  that  since  all  other  positions, 
even  matching  candidates  (for  which  ds(l)  was  just  verified)  cannot  be  real 
occurrences  since  they  will  disagree  with  the  verified  d3(l)  for  position  I  or  r. 
(The  two  templates  cover  the  y  text  positions  under  consideration,  because 
there  are  no  occurrences  before  I  or  after  r  and  f+y  —  fc>r  —  fc  +  1.)  Thus, 
the  candidates  that  survive  for  the  next  stage  are  those  among  the  matching 
candidates  aligned  with  a  position  in  Ai  relative  to  I  or  r. 

We  can  continue  in  this  manner.  In  stage  t  there  will  be  a  set  of  can¬ 
didates.  The  leftmost  is  /  and  the  rightmost  r  and  the  set  of  candidates  is 
aligned  with  the  subset  of  A,-  relative  to  f  or  r  for  which  Z)s(l), . . .  ,Z?s(t) 
have  been  verified.  (We  described  stage  0.)  However  this  will  take  logm 
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stages.  Note  that  in  the  second  stage  we  have  at  most  ^  candidates.  So 
we  can  achieve  double  progress  with  the  szune  processors:  In  stage  1  we  can 
verify  ds(2)  and  ds(3). 

At  the  start  of  a  general  stage,  assume  the  leftmost  and  rightmost  candi¬ 
dates  are  I  and  r  and  the  candidates  are  positions  aligned  with  elements  of 
At  (relative  to  /  or  r)  for  which  Z)3(l),  •  • . ,  Ds^s)  have  been  verified.  Since 
|A,|  <  the  m  processors  now  verify  Ds{s  -|- 1), . . . ,  ds(s  -f  2*)  for  all  the 
candidates.  For  the  purpose  of  efficient  processor  assignment  they  will  be 
assigned  to  verify  2‘  positions  for  all  the  elements  in  A,.  Those  assigned  to  a 
non  candidate  simply  do  nothing.  As  above,  we  define  matching  candidates 
as  the  candidates  for  which  all  positions  were  verified  as  matches,  I  and  r 
as  the  leftmost  and  rightmost  matching  candidates  and  the  candidates  sur¬ 
viving  for  the  next  stage  are  the  matching  candidates  that  are  aligned  with 
A,+2*  (relative  to  I  or  r).  Since  the  new  value  of  s  is  larger  thzm  2*,  we  have 
at  most  0(log*  m)  stages,  each  of  which  takes  constant  time.  The  number  of 
processor  is  m. 

This  exponential  acceleration  phenomenon  was  called  the  accelerating  cas¬ 
cade  design  principle  by  Cole  and  Vishkin  [18]  where  by  carefully  choosing 
the  parameters  they  were  able  to  get  an  optimal  0(log*  m)  time  parallel  al¬ 
gorithms  for  another  problem.  For  the  complete  description  of  the  algorithm 
see  Vishkin’s  paper  [47]. 

6  Open  questions 

•  String  matching  over  a  fixed  alphabet.  The  lower  boimd  of  Section  4 
assumes  the  input  strings  2U’e  drawn  &om  a  general  alphabet  and  the 
only  access  to  them  is  by  comparisons.  The  lower  and  upper  bounds 
for  the  string  matching  problem  over  a  general  alphabet  are  identical 
to  those  for  comparison  based  maximum  finding  2dgorithm  obtained 
by  Valiant  [45].  A  constant  time  algorithm  can  find  the  maximum  of 
integers  in  a  restricted  range  [25]  which  suggests  the  possibiUty  of  a 
fsLster  string  matching  algorithm. 

•  Faster  randomized  algorithm.  The  similarity  to  the  maximxim  finding 
algorithm  and  the  existence  of  a  constant  expected  time  randomized  al¬ 
gorithm  for  that  problem  suggests  the  possibility  of  a  faster  randomized 
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string  matching  algorithm. 

•  String  matching  with  long  text  strings.  If  the  text  string  is  much  longer 
than  the  pattern,  the  lower  boxmd  of  Section  4  does  not  apply.  Indeed, 
on  a  comparison  model  where  all  computation  is  &ee  one  can  do  the 
preprocessing  for  Vishkin’s  fast  algorithm  in  constant  time  using 
processors.  K  n  =  the  n  processors  axe  available  to  preprocess  the 
short  pattern.  However,  it  is  not  known  if  the  preprocessing  can  be 
performed  on  a  CRCW-PRAM,  or  how  is  can  be  done  faster  with  less 
then  processors  on  a  comparison  model. 

•  String  matching  with  preprocessing.  What  are  the  exact  bounds  if 
preprocessing  is  free  like  in  Vishkin’s  fast  algorithms.  A  constant  time 
optimal  algorithm  is  still  possible. 

•  String  matching  on  CREW  and  EREW-PRAM.  The  fastest  optimal 
CREW-PRAM  deterministic  algorithm  is  obtained  by  slowing  down- 
the  CRCW-PRAM  algorithm  to  O(logmloglogm)  time.  What  is  the 
exact  bound  on  these  models. 
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